Developments in the theory of randomized 
shortest paths with a comparison of graph node 
distances 

O ■ Ilkka Kivimaki*, Masashi Shimbo**, and Marco Saerens* 

^ \ Universite catholique de Louvain - Louvain-la-Neuve, Belgium 

j!t ■ ** 

M . Nara Institute of Science and Technology - Ikoma, Japan 

Abstract 

J' 

There have lately been several suggestions for parametrized distances on a graph that generalize the 
shortest path distance and the commute time or resistance distance. The need for developing such dis- 
tances has risen from the observation that the above-mentioned common distances in many situations 
fail to take into account the global structure of the graph. In this article, we develop the theory of one 
family of graph node distances, known as the randomized shortest path dissimilarity, which we show 
to be easily computable in closed form for all pairs of nodes of a graph. Moreover, we come up with a 
new definition of a distance measure that we call the free energy distance. The free energy distance can 
be seen as an upgrade of the randomized shortest path dissimilarity as it satisfies several nice properties 
^sO , for a distance. In addition, the derivation and computation of the free energy distance are quite straight- 

■ forward. We also make a comparison between a set of generalized distances that interpolate between 

the shortest path distance and the commute time, or resistance distance. This comparison focuses on the 
applicability of the distances in graph node clustering. 
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1 Introduction 
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Defining distances and similarities between nodes of a graph based on its structure has be- 



come an essential task in the analysis of network data l^lllnlZ^I^IZZllTlll55l . In the simplest 
case a binary network can be presented as an adjacency matrix or adjacency list which can be 
difficult to interpret. Acquiring meaningful information from such data requires sophisticated 
methods which often need to be chosen based on the context. Being able to measure the dis- 
tance between the nodes of a network in a meaningful way of course provides a fundamental 
way of interpreting the network. With the information of distances between the nodes, one 
can apply traditional multivariate statistical or machine learning methods for analyzing the 
data. 

The most common ways of defining a distance on a graph are to consider either the lengths of 
the shortest paths between nodes, leading to the definition of the shortest path (SP) distance, or 
the expected lengths of random walks on the graph, which can be used to derive the commute 
time (CT) distance [20\. The CT distance is known to equal the resistance distance ll2ll[T3l up to 
a constant factor |6|. In this paper, we examine generalized distances on graphs that interpo- 
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late, depending on a parameter, between the shortest path distance and the commute time or 
resistance distance. 

The paper contains several separate contributions: First, we develop the theory of one gen- 
eralized distance, the randomized shortest path (RSP) dissimilarity 11391 I3T1 . We derive a new 
algorithm for computing it for all pairs of nodes of a graph in closed form, and thus much 
more efficiently than before. We then derive another generalized distance from the RSP frame- 
work based on the Helmholtz free energy between two states of a thermodynamic system. We 
show that this free energy (FE) distance actually coincides with the potential distance, proposed 
in recent literature in a more ad hoc manner [18]. However, our new derivation gives a nice 
theoretical background for this distance. Finally, we make a comparison of the behavior and 
performance of different generalized graph node distances. The comparisons are conducted 
by observing the relative differences of distances between nodes in small example graphs and 
by examining the performance of the different distance measures in clustering tasks. 

The paper is structured as follows: In Section|2j we define the terms and notation used in the 
paper. In our framework, we consider graphs where the edges can be assigned weights and 
costs, which can be independent of each other. In Section [3). we recall the definitions of the 
common distances on graphs. We also present a surprising result related to the generaliza- 
tion of the commute time distance considering costs, namely that the distance based on costs 
equals the commute time distance, up to a constant factor. In Section [4] we revisit the defini- 
tion of the RSP dissimilarity [39, 31]. We then derive the closed form algorithm, mentioned 
above, for computing it, and then formulate the definition of the FE distance. In Section [5l 
we present other parametrized distances on graphs interpolating between the SP and CT dis- 
tances that have been defined in recent literature, Section [6] contains the comparison of the 
RSP dissimilarity, the FE distance and the generalized distances defined in Section [5] Finally, 
Section sums up the content of the article. 

2 Terminology and notation 

We first go through the terminology and notation used in this paper. We denote by G = (V, E) 
a graph G consisting of a node set V — {1, 2, . . . , n} and an edge set E = {(i, j)}. Nodes i and 
j such that S E are called adjacent or connected. Each graph can be represented as an 
adjacency matrix A, where the elements are called affinities, or weights, interchangeably. 
For unweighted graphs a,j = 1 if G E, for weighted graphs a,j > if G E and in 
both cases = if E. The affinities can be interpreted as representing the degree 

of similarity between connected nodes. A path, or walk, interchangeably, on the graph G is 
a sequence of nodes p = (io, ■ ■ • , ir), where T > and (i T , i T +i) G E V t = 0, . . . ,T — 1. 
The length of the path, or walk, p, is then T. Note that throughout this article we include 
zero-length paths (i), i G V in the definition of a path, although in some contexts it may be 
more appropriate to disallow this by setting T > 1 in the definition. Moreover, we define 
absorbing, or hitting paths as paths which contain the terminal node only once. Thus a path p 
is an absorbing path if p = (i , . . . i T ), where ir ^ V Vr = 0, . . . , T — 1. 

In addition to affinities, the edges of a graph can be assigned costs, Cij, such that < < 
oo if G E. In principle, we do not define costs for unconnected pairs of nodes, but 

when making matrix computations, we assign the corresponding matrix elements a very large 
number, i.e. a number close to the maximum computational limits of a computer. A common 
convention is to define the costs as reciprocals of the affinities c,j = 1/ay. This applies both 
for unweighted and weighted graphs. This way the edge weights and costs are analogous to 
conductance and resistance, respectively, in an electric network. However, the costs can also 
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be assigned independently of the affinities, allowing a more general setting. This can be useful 
in many applications because links can often have a two-sided nature, on one hand based on 
the structure of the graph and on the other hand based on internal features of the edges. One 
such example can be a toll road network, where the affinities represent the proximities of 
places and the costs represent toll costs of traversing a road. This interpretation is especially 
useful in graph analysis based on a probabilistic framework, wherein the emphasis of this 
paper also lies. Finally, we define the cost of a path p as the sum of costs along the patrQ 

We denote by e the n X 1 vector whose each element is 1. For an n x n square matrix A, 
let Diag(A) denote the n x n diagonal matrix whose diagonal elements are the diagonal 
elements of A. Likewise, for an n x 1 vector v, Diag(v) denotes the n x n diagonal matrix with 
diagonal elements from v. We use exp(A) and log(A) to denote the elementwise exponential 
and logarithm, respectively; these should not to be confused with the matrix exponential and 
matrix logarithm which are not used in this article. Furthermore, we use A o B and A^B for 
elementwise product and division, respectively, of n x m matrices A and B. 



3 The shortest path and commute time distances 

The most common distance measure between two nodes of a graph is the shortest path (SP) 
distance. As introduced earlier in Section [TJ in our framework, we consider costs associated 
to the edges of a graph. Hence, we define the SP distance between two nodes as the minimal 
cost of a path between the nodes. This applies for both unweighted and weighted undirected 
graphs. Also recall that edge costs can be independent of the affinities a, j . Thus, our definition 
of the SP distance does not necessarily depend on the affinities, either, but only on the costs. 
In addition, we define the unweighted SP distance between two nodes as the minimal length of 
a path between the nodesQ 

The SP distance can be used, for example, for estimating the geodesic distance between points 
when assuming that the graph points lie on a manifold. One popular method to use this idea 
is the Isomap algorithm [34] for nonlinear dimensionality reduction. One major drawback of 
the SP distance is that it does not take into account the global structure of the network. In 
particular, it does not consider the number of connections that exist between nodes, only the 
length of the shortest one. 

Another interesting and well-known graph distance measure is the commute time (CT) dis- 
tance [20 J which is defined between two nodes as the expected length of paths that a random 
walker moving along the edges of the graph has to take from one node to the other and back. 
The transition probability pij of the walker moving from a node i to an adjacent node j is 
given conventionally as 

(l) P«=v^- 

The CT distance is well known to be proportional to the resistance distance [6] which is defined 
as the effective resistance of a network when it is considered as an electric circuit where the 

1 Throughout the article we will use the tilde (~) to differentiate quantities related to paths from quantities related 
to edges. 

2 Some authors, e.g. in (7), instead call the SP distance based on the edge weights the weighted SP distance and 
use the term SP distance only for the distance based on the number of edges on paths. However, there the costs (or 
resistances) are fixed as the reciprocals of affinities, unlike in our approach. 
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poles of a unit volt battery have been attached to the nodes between which the distance is 
being measured ETIIT31 . 

We can also define a generalization of the commute time distance that considers costs of paths 
instead of their lengths. More precisely, we define the commute cost (CC) distance as the expected 
cost of the paths that a random walker will take when moving from a node to another and 
back according to the transition probabilities p*j f33l . An interesting, somewhat unintuitive 
result in this context is that in an undirected graph, the commute cost distance is proportional to 
the commute time distance. We provide the proof for this result in Appendix [A] Here, it is 
important to remember that the costs are independent of the weights and vice versa. Thus the 
same applies between the costs and the transition probabilities of the random walker. This 
result means that the commute time, commute cost and resistance distances are all the same 
up to a constant factor. Thus, in most practical applications they will give the exact same 
results, because in practice the interest lies in the ratios of pairwise distances instead of the 
distances themselves. 

A nice thing about the commute time, commute cost and resistance distances, when compared 
to the SP distance, is that they take into account the number of different paths connecting 
pairs of nodes. As a result, these distances have been utilized in different applications of 
network science with benefitial results. However, it has been noted that in a large graph these 
distances are affected largely by the stationary distribution of the natural random walk on 
the graph [5J. Finally, von Luxburg et al. |36J showed that in certain models, as the size of a 
graph grows, the resistance distance (and thus the CT and CC distances as well) between two 
nodes become only dependent on trivial local properties of the nodes. More specifically, the 
resistance distance between two nodes approaches the sum of the reciprocals of the degrees 
of these two nodes. 

An intuitive explanation of this phenomenon is that in very large graphs a random walker has 
too many paths to follow and the chance of the walker finding its destination node becomes 
more dependent on the number of edges (instead of paths, per se) that lead to the node. This 
undesirable phenomenon serves as one motivation for defining new graph node distances 
that choose an alternative between the SP and CT distances. This idea already appeared in 
the development of the RSP dissimilarity [39. 31 1, with the main motivations in path plan- 
ning and simply in proposing a distance interpolating between the SP and CT distances. In 
the following, we first recapitulate the definition of the RSP dissimilarity and then develop 
the theory behind it. After this we will review other generalized graph node distances and 
compare their use in machine learning. 

4 Advances in the randomized shortest paths framework 

The RSP dissimilarity was defined in f39l inspired by [1J and its theory has been extended 
further in [ 31 1 and [ 19 1 . It is based on the interpretation of random walks in terms of statistical 
physics. The definition involves a parameter 8 which is analogous to the inverse temperature 
of a thermodynamical system. The RSP dissimilarity is shown to converge to the SP distance 
as 9 — > oo and to the CT distance as 8 — > + . 

The reason why the RSP dissimilarity is called a dissimilarity, rather than a distance, is that for 
intermediate values of the parameter 9, it does not satisfy the triangle inequality, meaning that 
it is only a semimetric. In this paper we focus on the effect of the choice of a distance measure 
on clustering. When studying clustering algorithms, it is often assumed that they are used in 
conjunction with a metric, i.e., a distance measure that satisfies the triangle inequality. Also, 
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triangle inequality can be used to improve the efficiency of some distance-based algorithms, 
cf. fT4ll . However, we only focus in this paper on kernel fc-means clustering which works well 
even with a semimetric. Furthermore, it has been already shown that using the RSP dissimi- 
larity with its intermediate parameter values provides good results in graph node clustering 
and semi-supervised learning tasks [39J. 



4.1 The randomized shortest path dissimilarity 

The RSP dissimilarity is defined by considering a random walker choosing an absorbing, or 
hitting, path from a source node s to a destination node t, meaning that the node t can appear 
in the path only once, as the ending node. Let V st denote the set of such paths and let p — (i 1 — 

— ref 

s, . . . , it = t) e Vst- The reference probability of the path p is P st (p) = p lll2 ■ ■ ■ Vit-^t ■ I* simply 
corresponds to the likelihood of the paths, i.e., the product of the transition probabilities. 

In the RSP model, the randomness of the walker is constrained by fixing the relative entropy 
between the distribution over paths according to the reference probabilities and the distribu- 
tion over paths that the walker actually chooses from. With this constraint, the walker then 
chooses the walk from the probability distribution that minimizes the expected cos$ 

c(P st ) = £ V st {p)c{p) 

of going from node s to node t. Thus, the relative entropy constraint controls the exploration of 
the walker, whereas the minimization of expected cost controls its exploitation. Formally, the 
walker moves according to the distribution 

^ RSP _ | E P st (p)log(p st (p)/pf(p)) = Jo 

P si = argminc(P st ) subject to < P^Pat V / 

Pa* E Pt(p) = l 

L per s t 

The minimization is shown [39| to result in a Boltzmann distribution 

~ref _ 

(2) PT(P) = ^fr> , 

E P st (p)exp(-0c(p)) 

where the inverse temperature parameter 9 controls the influence of the cost on the walker's 
selection of a path. When applying the model, the user is assumed to provide 9 as an input 
parameter instead of the relative entropy Jo- 

After deriving the optimal distribution for minimizing the expected cost, the authors define 
the RSP dissimilarity between the nodes s and t as this expected cost (after symmetrization), 
formally 

B ™ / —RSP, — RSP \ , 

z^ p =(c(P st ) + c(F ts ))/2. 

-RSP 

The authors develop an algorithm for computing the expected cost c(P st ) which is not at all a 
trivial task. In the next section we develop a new, more efficient algorithm for computing the 
expected costs and thus the matrix A RSP of the RSP dissimilarities between all pairs of nodes 
of a graph in closed form. 



3 Notice the difference in notation between the expected cost (denoted with a bar as c) and the cost of a particular 
path (denoted with a tilde as c). 
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4.2 Algorithm for faster computation of A RSP 

We now show how to compute the RSP dissimilarity and then develop an algorithm that 
allows computing the set of all pairwise RSP dissimilarities between the nodes of a graph in a 
batch mode. The algorithm in the original reference [39| performs a loop over all the nodes of 
the graph and computes the needed quantities considering the node as absorbing. Our new 
algorithm is based solely on matrix manipulations and can thus provide faster execution than 
a naive looping. 

The computation of the expected cost c(P st ) is based on considering the denominator of the 
right side of Equation © and denoting 

(3) 4 = £ l£( P )exp(-0c( P )). 

p£Tst 

This quantity is in statistical physics called the partition function of a thermodynamical sys- 
tem. In our case, the system consists of the paths in V s t- The partition function is essential 
for deriving different quantities related to the RSP framework. Indeed, by manipulating the 
expected cost of travelling from node s to node t we see that 

— ref 

c ( p sf ) = 2^ p s* (pMp) = 



z h 

pev st st 



(4) 



i dzl a log 



89 89 



meaning that the expected cost can be obtained by taking the derivative of the logarithm of 
the partition function. 

Let us denote by C the matrix of costs on edges, cy, and by P ref the transition probability 
matrix of the natural random walk associated to the graph G containing the elements p^ f . The 
latter can be computed from the adjacency matrix as P ref = D 1 A, where D = Diag(Ae). In 
order to compute the partition function, we define a new matrix 

W = P ref ocxp(-6»C). 

Hence matrix W is substochastic, and thus it can be interpreted as a new transition matrix 
defining an evaporating, or killing random walk on the graph [31 1. This means that at each step 
of the walk the random walker has a non-zero probability of stopping its walk, i.e. evaporat- 
ing. 

Remember now that we want to make the destination node t absorbing. For this, we define 
a new matrix by setting the row t of W to zero: Wj, = W e f (wJ) T , where e f is a vector 
containing 1 in element t and elsewhere and w\ is row t of matrix W as a column vector. 

The powers of this matrix, (W^) 7 ", contain in element (s, t) the probability that a killing ran- 
dom walk of exactly r steps leaving from node s ends up in node t when obeying the transition 
probabilities assigned by Wj,. This can also be expressed as 

[(W h ) r ] si = £ P r ; t f (p)exp(-^(p)), 

p£V st (r) 
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where P st (r) denotes the set of paths of exactly length r going from node s to node t. Then 
by summing over all walk length^] r we can cover all hitting paths from node s to t and write 
the partition function as a power series 

~ f 00 ~ f 

4 - E P " (P)exp(-tfc(p)) = £ £ (p) exp(-0c(p)) 



E[(w h ru = [(i-w h )- i ] i 



T = 

The series converges to the matrix Zh = (I — Wh) -1 as the spectral radius of Wh is less than 
one, p(Wh) < 1. The matrix Zh is the fundamental matrix corresponding to the killing Markov 
Chain with the transition matrix Wh- 
in the original reference Il39l , the authors then use the Sherman-Morrison update rule for 
deriving the form 

Ze t wTZ 
1 + wTZe t 

which seems dependent of the absorbing node t. They then use this form of Zh for computing 
the dissimilarities from each node of the graph to one fixed node t at a time. In order to 
compute all dissimilarities in the graph, the algorithm just loops over all the nodes of the 
graph considering them as absorbing one at a time. Note that the matrix Z = (I — W) _1 
needs to be inverted only once throughout the process. 

However, it is shown in Appendix A in [18[, where the authors define a probabilistic model 
called the the bag-of-paths framework, that in fact the above expression transforms further 
into 

z z z c t ((zjj) T - ej) ^ 

ztt 

When this is used for determining the element (s, t) of the matrix Zh, it turns out that 

(5) 4 = (Z h ) s * - — 

Ztt 

meaning that the whole matrix can be computed using Z simply as Zh = ZD^ 1 , where Dh = 
Diag(Z) is the diagonal matrix with elements Zu on its diagonal. 

Now we can finally derive the new matrix formula for the RSP dissimilarities between all 
pairs of nodes using Equations lHJ and ((5). The expected cost is given by 

(f s -tv VSF \ - dkig gig _ d\og(z st /z tt ) _ d log z st d log z u 

( ) C( st ' ~ 06 ~ 86 ~ 86 + 86 ■ 



4 Although in [39 ] only paths of length > 1 are considered, we also include paths of length 0, i.e. the paths that 
consist of only one node and no links, into the set of allowed paths; see 1 18 1 for a discussion related to this. 
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Algorithm 1 Computation of the matrix of all pairwise RSP dissimilarities of G. 
Require: 

- A graph G containing n nodes. 

- The n x n adjacency matrix A associated to G — (V, E), containing affinities. 

- The n x n cost matrix C associated to G. 

- The inverse temperature parameter 9. 
Ensure: 

- The n x n matrix containing the RSP dissimilarities A^f F for all i, j £ V. 

1. D Diag(Ae) {the row-normalization matrix} 

2. P ref <— D^ 1 A {the reference transition probabilities matrix} 

3. W «— P lef o exp [— $C] {elementwise exponential and multiplication o} 

4. Z «— (I — W) _1 {the fundamental matrix} 

5. S <- Z(C o W)Z ^ Z {the S matrix} 

6. ds Diag(S) {the vector of diagonal values of S} 

7. C <— S — edg {the matrix of expected costs} 

8. A RSP <s— (C + C T ) /2 {the dissimilarity matrix by symmetrization} 

9. return A RSP 



The first term can be computed by 



d log Zat = J_dzst_ = 1 de T s Ze t = J_ T d(I- W)" 1 
= -±eJ(.-W)-M (I _ w) -. e , 

= — eJZ— Ze t 

z st dQ 



— eJZ(CoW) Ze t , 



where we used — = — (Pf/ o exp(-6»C)) = - (C o W). 
Thus, we can write Equation |(6]l as 

m _n5RSP> eJZ(CoW)Ze t e[Z(CoW)Ze t 

(7) c(P st ) = 

Zst Z t t 

Let us then denote S = Z(CoW)ZvZ, where -r marks elementwise division. In fact, S is the 
matrix form of the first term on the right side of Equation 10, and contains the expected costs 
of non-hitting random walks. We can now use it to write out the matrix form of computing 
all the expected costs, or directed dissimilarities as 

C = S - edg, 

where ds = Diag(S) is the vector of diagonal elements of S. Finally, the matrix of RSP 

dissimilarities A RSP is defined by symmetrizing C: A RSP = (C + C T )/2. The whole procedure 
of computing A RSP is presented in Algorithm [1] The advantage compared to the algorithm 
presented in [ 39 [ is that the matrix of dissimilarities can be computed in closed form by matrix 
multiplication instead of a loop. 
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4.3 A new generalized distance based on Helmholtz free energy 

As already mentioned earlier, one of the drawbacks of the RSP dissimilarity is that it is not a 
metric as it does not satisfy the triangle inequality for intermediate values of 9. To overcome 
this problem we derive a new distance measure called the free energy distance, which is based 
on the same idea behind the RSP dissimilarity. We conclude that the proposed free energy 
distance is actually the same as the potential distance defined recently in [18 [. However, in that 
reference, the derivation of the potential distance is left rather unmotivated. The derivation 
provided here gives a more sound theoretical background to the distance measure and thus 
we suggest to call the distance the free energy distance instead of the potential distance. Also, 
it is worth mentioning that in 111 81 , another distance measure is defined based on the bag-of- 
paths framework, called the surprisal distance. We have also run experiments with the surprisal 
distance, but finally decided to leave it out of the current presentation, because it does not 
generalize the CT distance, unlike the other distance measures under study Nevertheless, we 
noticed that the surprisal distance also performs well in the clustering tasks presented with 
the other distance measures in Section[6] 

The free energy has already been used in various contexts in network theory. In [12J, the 
authors define a ranking method called the free-energy rank (in the spirit of the well-known 
PageRank |28|) by computing the transition probabilities minimizing the free energy rate en- 
countered by a random walker. Then, the stationary distribution of the defined Markov chain 
is the free-energy rank score. In [3|, the authors compute edge flows minimizing the free en- 
ergy between two nodes. The resulting flows define some new edge and node betweenness 
measures, balancing exploration and exploitation through an adjustable temperature parame- 
ter. Their model is quite close to the RSP framework and was developed parallel to our article. 
However, the authors do not define a distance measure based on the free energy. 

We now derive the free energy distance and then show that it coincides with the potential dis- 
tance. Recall that the RSP dissimilarity was defined by considering a distribution of random 
walks between two nodes that minimizes the expected cost c(P st ) subject to a relative entropy 
constraint. Now, instead of the expected cost c(P s t), let us consider a random walker choosing 
a path from node s to node t according to the distribution that minimizes the quantity 

(8) <KP flt ) = c(P flt ) + J(P st ||P s e t V, 

~ ~ref ~ref 

where J(P s t ||P st ) is the relative entropy of the distribution with respect to P . The quantity 
4> is known in statistical physics as the Helmholtz free energy [29] of a thermodynamical system 
with temperature T = 1/8 and state transition probabilities P s t@ 

The minimization of free energy can be simply written as 

P™ = argmin £ V st {p)Z{p) + ] J2 ?M ^g(P st (p)/f^(p)) s. t. £ P si (p) = 1. 

P s t pev st pEVst pev s t 

It is not difficult to see that this problem becomes equivalent to the minimization problem 
involved in the definition of the RSP probabilities and thus the optimal solution is the Boltz- 

~FE -RSP 

mann distribution (|2), in other words, F st = F st . We define the free energy distance between 
nodes s and t as the symmetrized minimum free energy between these two nodes, in other 
words 

PE — FE \ 

0(P st )+0(P i5 ))/2 

5 Conventionally, the Helmholtz free energy is defined with the entropy of P s t in place of the relative entropy. 
Regardless of this, we use the term as we have presented it. 
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In order to show that the free energy distance coincides with the potential distance defined 
in 1 18], we remind that the RSP probability can be written as (see Equations © and 10) 

~ref 

p , j = 1st (P) exp(-6>c(p)) 

z st 



Using then the fact that P«t(p) = h we can write out the expression for the relative en- 
tropy: 




= -&c(F st ) - log(z* ) 
When combining this result with Equation ((8), the free energy becomes 

^(P st ) = -ilog(^ t ) 

which after symmetrization equals the potential distance defined in [18 1 and can be computed 
easily thanks to Equation (0. Thus, we have shown that the potential distance derived within 
the bag-of-paths framework, in fact can be derived from the RSP framework by considering 
the minimum Helmholtz free energy as the distance, instead of the minimum expected cost. 
We also note that the quantity log z]} t already appeared in [19] as a potential inducing a drift 
for a random walker in a continuous-state extension of the RSP framework. 

Of course, the free energy distance also satisfies all the properties that were proved for the 
potential distance in (TBI . Most importantly, it was shown that the distance obeys the triangle 
inequality as opposed to the RSP dissimilarity. The distance also converges to the SP distance 
when 9 — > oo and to the CT distance when 9 — > + . In addition, it is shown to be graph 
geodetic 0, meaning that A^f = A^. + A v ^ if and only if all paths from node s to node t 
go through node k. This shows that the minimum free energy between two nodes defines a 
meaningful distance measure between graph nodes with nice properties. 



5 Other generalized graph distances 

There have been a few other suggestions for graph distances that generalize the resistance or 
CT and the SP distances. Alamgir and von Luxburg defined a generalized distance called the 
p-resistance distance in order to tackle the problem of the resistance distance becoming mean- 
ingless with large graphs [2J. Indeed they show that with certain values of the parameter 
p, the p-resistance distance avoids this pitfall. In addition, Chebotarev has defined several 
parametrized graph distance measures [101 [TUB). I n this paper, we focus on the logarith- 
mic forest distances [7[. In addition to these two distances, we want to experiment a simple 
generalized distance that only takes a weighted average of the resistance and SP distances. 
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Algorithm 2 Computation of the matrix of all pairwise free energy distances of G. 

Require: 

- A graph G containing n nodes. 

- The n x n adjacency matrix A associated to G — (V, E), containing affinities. 

- The n x n cost matrix C associated to G. 

- The inverse temperature parameter 9. 
Ensure: 

- The n x n matrix containing the RSP dissimilarities A^f F for all i, j £ V. 

1. D Diag(Ae) {the row-normalization matrix} 

2. P ref <— D^ 1 A {the reference transition probabilities matrix} 

3. W «— P lef o exp [— $C] {elementwise exponential and multiplication o} 

4. Z «— (I — W) _1 {the fundamental matrix} 

5. Z h = ZD^" 1 {the fundamental matrix of hitting paths} 

6. <& < log Zh {the negative elementwise logarithm} 

7. A FE + 4> T )/2 {the distance matrix by symmetrization} 

8. return A FE 



5.1 p-resistance distance 

Alamgir and von Luxburg [2] define a generalization of the resistance distance, called the p- 
resistance distance. Like the resistance distance, the p-resistance distance considers the graph as 
an electrical resistance network, where the edges (k,l) G E of the network have resistances rjy 
(similar to costs) and a unit volt battery is attached to the target nodes whose distance is being 
measured. This forms a unit flow from s to t, (iki)s^t, where the currents iki are assigned on all 
the edges (k, I) e E of the graph. In short, this means that for all k, I such that (fc, I) £ E the 
currents iui satisfy the following three conditions: (1) iki = —ilk, (2) Yli ^si = 1 and J2k *fc* 

I and (3) J^i iki — for s ^ k ^ t. Then for a constant p > 0, the p-resistance distance is 
defined as the minimized p-resistance (w.r.t. the unit flow) between s and t, formally as 



When the parameter p = 2, the above definition becomes the definition of effective resistance, 
i.e. the resistance distance and when p = 1 the distance coincides with the SP distance. Von 
Luxburg et al. [2] show that there exists a value 1 < p < 2 for which the p-resistance distance 
avoids the problem of the traditional resistance distance with large graphs. In a closely related 
work [24 j, the authors also study network flow optimization in the same spirit as with the p- 
resistance. Their viewpoint is based on network routing problems and provides a spectrum of 
routing options that make a compromise between latency and energy dissipation in selecting 
routes in a network. They, however, do not explicitly define a graph node distance. 

The p-resistance distance is theoretically sound, but it lacks a closed form expression for com- 
puting all the pairwise distances of a graph. Thus, the result can only be obtained by solving 
the minimization 10 for each pair of nodes separately. This currently limits the method to be 
applicable only for small graphs. 

5.2 Logarithmic forest distances 

The logarithmic forest distance has its foundation in the matrix-forest theorem and another 
family of distances developed earlier by Chebotarev called simply the forest distance I9l fl0l . 



(9) 
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The definition of the logarithmic forest distance goes as follows. First, we define the Laplacian 
(or Kirchhoff) matrix of a graph G as L = D — A, where D = Diag(Ae). Then we consider 
the matrix 

Q = (I + aL)-\ 

where a > 0. The elements of this matrix measure the relative forest accessibilities [9] which 
can be considered as similarities between nodes of the graph after all its edge weights have 
been multiplied by the constant a. In fact, in |7| Chebotarev handles a more general case by 
considering arbitrary transformations of the edge weights and multigraphs instead of graphs. 
The definition proceeds by taking the elementwise logarithmic transformation 

M = 7 (a-l)log Q Q, 

where 7 > is another parameter and the logarithm is taken elementwise in basis a. This 
expression provides another similarity measure. From it, the matrix of logarithmic forest 
distances is derived as 

A io g For = i( me T + em T ) - M, 

where m = diag(M). The last transition is a classical way of defining a matrix of distances 
from a matrix of similarities (4). 

The above definition provides a metric which also satisfies the geodesic property (see Sec- 
tion S3). For any positive value of the parameter 7, the logarithmic forest distance becomes 
proportional to the CT and the SP distances as a — > + and a — > 00, respectiveljU. In the 
special case of 7 = log(e + a 2 /™), Chebotarev shows that the logarithmic forest distance ap- 
proaches exactly these two other distances. However, this form is not very practical, because 
even with moderate size graphs the exponent 2/n cancels out the effect of setting a large value 
to a. Thus, we decided simply to assign 7 = 1 in our experiments. 



5.3 Weighted average between SP and CT distances 

The graph distance families presented above all involve a sophisticated theoretical deriva- 
tion. In the experiments we want to compare these distances also to a baseline model that 
generalizes the SP and CT distances, namely the weighted average of the two distances: 

^p- ct = A^t + (1 _ a)z1 sp ; 

where A e [0, 1]. We call it straightforwardly the SP-CT combination distance; it is a distance 
because a convex combination of metrics is also a metric. Although the distance does not 
contain as interesting details as the other distances, it at least appears competitive with the 
more intricate distance measures in the clustering experiments presented below. 



6 Experiments 

In this section, we compare the different distance families presented in the previous Sections, 
namely the RSP dissimilarity, the free energy distance, the p-resistance distance, the loga- 
rithmic forest distance and the SP-CT combination distance. First, we consider small artificial 
graphs and study the behavior of the different distances with different parameter values. This 



6 More accurately, the logarithmic forest distance converges to the unweighted SP distance (see Section[3) but we 
nevertheless include it in our comparison. 
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Fig. 1 : The extended triangle graph (a) and the ratio of distances A12/A23 (b) with the RSP 
dissimilarity (RSP), the free energy distance (FE), the p-resistance distance (pRes), the 
logarithmic forest distance (logFor) and the SP-CT combination distance (SP-CT) for 
the different ranges of parameter values. 



is done by seeing how the relation between distances of different pairs of nodes changes as the 
parameter value is altered. As mentioned by Chebotarev |8|, the interest in comparing differ- 
ent distance measures does not lie in the pairwise distances themselves, but in the proportions 
between the pairwise distances. We then use the distance families for clustering small real 
world networks and compare the clusterings obtained with different distances and parameter 
values. Finally, we run a series of systematic clustering experiments with larger networks in 
order to compare the quantitative performance of the different families of distances. 



6.1 Comparisons with small graphs 



In the first example, we use the simple graph depicted in Figure 1(a) consisting of a triangle, 
i.e. a 3-clique connected to an isolated node. We call it the extended triangle graph. We observe 
the proportions of distances between nodes 1 and 2 and nodes 2 and 3, i.e. the quantities 
^12/^23 for all the different distance families. We plot the results in Figure [T(b)| using 20 
different parameter values for each family of distances. The parameter values are scaled in 
such a way that the relevant parameter range of each distance family becomes visible. In 
addition, the abscissa is logarithmic for all other parameters but linear for the A of the SP-CT 
combination distance. 

First of all we can observe that all curves converge to unity on the right hand end of the plot. 
This happens as all the distances converge to the shortest path distance and thus A\2 = A23 = 
1 for all distances. On the left end of the plot, all curves approach the value 1.5 which is the 
ratio of the CT distances between the nodes. In other words, for the CT distance J > A^ 
holds which is caused by the fact that the nodes 2 and 3 are, in a sense, better connected 
together (namely through node 4) than nodes 1 and 2. 

The real interest in Figure [T(b)| lies in the transformation that takes place in the intermediate 
parameter values of the distance families. We can observe that the ratio A12/ A23 changes 
monotonously with respect to the parameter value change in three cases, with the p-resistance, 
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Fig. 2: A graph with a 4-clique and a 3-clique and a hub node between them. 

the logarithmic forest distance and obviously the SP-CT combination. In other words, these 
three metrics always consider the distance between nodes 2 and 3 smaller than the distance 
between node 2 and the isolated node 1. 

However, with the free energy distance and the RSP dissimilarity, the ratio behaves non- 
monotonously. In other words, for a range of intermediate parameter values, these functions 
consider the distance between the isolated node 1 and the central node 2 to be smaller than the 
distance between nodes 2 and 3 (and between 2 and 4). Allowing this possibility could prove 
useful for a distance measure in applications. For example, in a social network a relationship 
with an isolated person can in some situations and contexts be considered stronger than the 
relationship with a member of a group. 

The phase transition that occurs with the free energy distance and the RSP dissimilarity in 
our small example case can have implications in more practical situations as well. Obviously, 
it can affect nearest neighbor related methods but also clustering applications. Consider, for 
example, a larger scale situation, as the graph depicted in Figure [2] This graph consists of two 
cliques of sizes 4 and 3 which are connected through a hub node (node 5) that has edges to all 
the other nodes of the graph. Consider then a clustering of the graph nodes into two clusters. 
The nodes in the two cliques obviously should belong to their own clusters. But which cluster 
should the hub node 5 be assigned to? This is generally a question of context and taste. In 
some cases there might be a preference for classifying the hub node to the smaller cluster, 
whereas in others it should be considered part of the larger cluster. One option would also be 
to put the hub node into its own cluster. However, here we are interested in cases where the 
number of clusters is fixed and a decision on the cluster assignment of the hub node has to be 
made. 

In this specific case, the p-resistance distance, the logarithmic forest distance and the SP-CT 
combination distance always consider node 5 closer to the larger clique than the small one. 
Thus, for example, when performing a fc -means based clustering with k = 2, using the men- 
tioned distances will always result in assigning the hub node 5 into the larger cluster. How- 
ever, the other three distances are more flexible. Namely, thanks to the phase transition seen 
in Figure [T(b)| performing fc-means with these distances can result in two different partitions 
depending on the parameter value. Worth pointing out is that since the shortest path distance 
between node 5 and all other nodes is 1, a fc-means clustering can result in either of the two 
interesting partitions, because with both partitions the global minimum within-cluster inertia 
is achieved. 

This observation might give some insight into the question of how to select the parameter 
value of a generalized distance measure in a specific task. So far, in applications, the parameter 
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has been tuned with external training data, as for example with the RSP dissimilarity in [39 J. 
An ideal solution would be to find a way to determine an optimal value in an unsupervised 
fashion only by looking at the structure of the graph. However, with graphs like the one in 
Figure |2j the quality of a partition can depend also on the context of the data instead of only 
the structure of the graph. Perhaps it is possible to infer the appropriate parameter value if 
the nature of the data (e.g. the type of the relations in a social network) is known, but even this 
seems quite idealistic. Thus, a supervised tuning procedure still seems like the best approach 
for deciding on the parameter value, when using the parametrized distances in applications. 

In any case, our examples with the small graphs above illustrate that there are subtle differ- 
ences between the generalized distance families. These differences may be useful for deciding 
which distance measure should be used in which case. In Sections 16.21 and 16-31 we test the 
different distance measures in clustering tasks. In Section [6721 the results obtained with the 
different distance families are quite similar. This indicates that the phenomenon observed 
with the small artificial graphs in this section do not seem to have a big influence when deal- 
ing with real world network data, at least with the methodology we use, namely the kernel 
fc-means algorithm, and the data sets we investigated. However, in Section [631 we see some 
differences in the capabilities of the different distance families in detecting desired clusters in 
data. In the future we will extend this investigation to other methods such as semi-supervised 
classification and link prediction. 

6.2 Clustering nodes of networks 

Next we will employ the different graph node distances in a graph node clustering task with 
real world network data sets. The main conclusion is that at least with small graphs the dif- 
ferent families provide quite similar results and that the differences observed in the examples 
in Section [6~ll cannot be detected with these experiments. However, in a more systematic 
comparison of the distance families, with larger data sets, we observe some differences in the 
results. 

For clustering, we employ the kernel fc-means algorithm introduced in [38]. It is based on 
searching for prototype vectors in a sample space by a fc-means type iteration. Similarities in the 
sample space induce distances between the data and the prototype vectors in an embedding 
space by application of the kernel trick [32 1. The goal of the algorithm is to find prototype 
vectors in the sample space that minimize the within-cluster inertia, i.e. the sum of the squared 
distances of each data point to its corresponding prototype in the embedding space. The 
prototype vectors are initially set by randomly selecting fc data points in the sample space. In 
order to use the kernel fc-means algorithm we need to switch each matrix of distances into a 
matrix of similarities. We transform the distance matrices A into similarity matrices K in a 
classical way [4] as 

(10) K = tjHDH. 

where H = I ee T /n is the centering matrix. These similarity matrices are not necessarily 
kernels in the traditional sense as they might not be positive definite. The positive definite- 
ness could be ensured by forcing the negative eigenvalues of the similarity matrix to zeros. 
However, we have noticed in experiments that this does not affect much the results nor the 
convergence of the kernel fc-means. 

In addition to the similarity matrices derived from the generalized graph distance matrices 
through (TTOb , we also use the sigmoid commute time kernel proposed by Yen et al. (38). They 
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construct the kernel by taking a sigmoid transformation of the elements of the commute time 
kernel which can be computed as the Moore-Penrose pseudoinverse L + of the graph Lapla- 
cian. Thus, the similarities given by this method are 



The parameter a controls the smoothing of the similarity values caused by the sigmoid trans- 
formation and a is the standard deviation of the elements l^. The sigmoid commute time 
kernel has been shown to perform well in many machine learning tasks, especially in the 
kernel fc-means method used in this paper. We consider it as a baseline for the clustering 
performance comparisons. Note, however, that it does not provide any clear generalization 
of neither the SP or CT distances (or similarities related to them), which explains its different 
behavior in the plots in comparison to the curves obtained with other methods. 

6.2.1 Zachary karate club network 

The first network for experimenting with the distance families is the famous Zachary karate 
club data set. It is a network of social interactions between members of an American univer- 
sity karate club [40[. During the collection of the data, the club split into two separate clubs 
because of discrepancies between two members of the club. These groups can be detected 
with most clustering and community detection algorithms from the network structure. 

After running experiments with different settings we came to the conclusion that the clus- 
tering results are very similar with all the different families of distances. In fact, when the 
clustering is run sufficiently many times (at least a few hundred times), the results with all 
distance families are practically the same. With each distance we obtain three different clus- 
terings depending on the parameter values. With parameter values that cause the distances to 
become close to the SP distance, the kernel fc-means clusters the nodes correctly into the two 
relevant groups. However, when the parameter values are changed towards the other end 
of the parameter range, there are two phase transition with each distance measure. First, at 
some parameter value the clustering misclassifies node 3 in the network. It has been noticed 
earlier (e.g. in [16]) that node 3 is a central node in the network and is thus often misclassified 
by clustering algorithms. 

As the parameters are twitched even further, the distances approach the CT (or resistance) 
distance which causes the algorithm to fail quite harshly. It seems that with the CT distance 
the minimum achieved within-cluster inertia is gained with a partition, where the other set 
consists only of nodes 5, 6, 7, 11, 12 and 17. Although this is far from the correct classification, 
it can be explained by studying the structure of the graph. Indeed, all of the nodes in the small 
cluster, except for node 12, do form a clear small community Node 12 has only one edge 
which connects to a hub node, node 1, which is also well connected to the nodes forming the 
rest of the clusters. 

6.2.2 The Political books network 

We also use the Political books data set (25 1 gathered by Valdis Krebfl The nodes in this 
unweighted network are books that have been labeled according to their political orienta- 
tion either as conservative, liberal or neutral. The edges of the network represent frequent 
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7 http:/ /www.orgnet.com/divided.html 
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Fig. 3: The mean NMI scores obtained with clusterings of the Political books network using 
different distance values and their different parameter values. 



purchases of two books by same people from the Amazon online bookstore. Thus, the classi- 
fication based on the political theme of the books should, at least to some extent, be observable 
from the network structure. 

In this experiment, we perform the kernel fc-means clustering for the Political books network 
20 times with different random initializations. Out of these 20 partitions we pick the one that 
results in the smallest within-cluster inertia. We compute the normalized mutual information 
(NMI) between this optimal clustering (according to the inertia) and the real labeling of the 
documents. This process is repeated another 20 times and the mean and standard deviation 
of the NMI scores of these clusterings are collected. 

The results are depicted in Figure |3] where, for the sake of clarity, we only show the mean val- 
ues of the NMI scores. We can see that also with this network the clustering performances are 
quite similar for all distance measures. All distance measures fail to separate the set of neu- 
tral books as its own clear cluster and the NMI scores vary between 0.53 and 0.61. However, 
when we examined the cluster assignments of individual nodes, we noticed more differences 
between the different distance families. The kernel fc-means misclassifies different nodes with 
different distance measures. The nodes that are misclassified also change with different pa- 
rameter values of the distance measures. With all the distance measures the clustering results 
are a bit better towards the left end of the plot, in other words when the distances are closer 
to the CT or resistance distances. 



6.2.3 The Football network 

In conjunction with the experiment with the Political books network, we also tried cluster- 
ing another well known data set, the American college Football network data set [26]. The 
unweighted network consists of 115 American football teams and they are connected by an 
edge if they played a game with each other during the regular season 2000. The teams can be 
divided into 12 conferences which should also be detectable from the structure of the network. 

We performed the same clustering experiment with the same settings as with the Political 
books data. The resulted mean NMI scores of each distance measure throughout the param- 
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Fig. 4: The mean NMI scores obtained with clusterings of the Football network using different 
distance values and their different parameter values. 

eter ranges are drawn in Figure |U This time, the sigmoid CT kernel can be distinguished as 
achieving overall the best clustering results with parameter values in the range a G [3, 12]. 
Otherwise, it is difficult to find big differences in the results obtained with the different dis- 
tance families. There is, however a noticeable difference with the results obtained with the 
Political books data set in the previous Section. Namely, this time the performance of the clus- 
tering improves on the right-hand side of the plot, in other words with parameter values that 
make the distances closer to the SP distance. 

6.3 A systematic clustering performance comparison 

So far, we have investigated the behavior of the different graph distance families in a detailed 
manner. Now we want to employ the distance families in a larger clustering task in order to 
compare their performances quantitatively. We use a collection of text document networks 
extracted from the 20 Newsgroups data se^|. A more detailed description of the collection 
that we use can be found in Table [J and in 1 38 ] . In short, our collection consists of ten dif- 
ferent weighted undirected networks, where the nodes represent text documents and edges 
and their weights are formed according to the co-occurence of words within the documents. 
Each network has been constructed by combining subsets of 200 documents from one topic. 
The networks consist of either two, three or five of such subsets of documents resulting in 
networks of 400, 600 and 1000 nodes. The goal of the clustering is then to detect the division 
of each network into these subsets. 

We could not obtain results in this experiment with the p-resistance because of its high com- 
putational cost discussed already in Section 15.11 and the sizes of the networks in the experi- 
ment. In addition, we want to see how well the method can generalize from one data set of 
a particular kind (here, a text document collection) to another in order to avoid running the 
experiment for a wide range of parameter values. To achieve this, we fix the parameters of 
each distance family and the sigmoid commute time kernel by using one of the ten networks 
as a tuning data set. We again perform the same repetitive clustering procedure with the ker- 
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Topic 
G-2cl-A 

Politics /general 
Sport/baseball 

G-3d-A 
Sport/baseball 
Space /general 
Politics/mideast 

G-5d-A 

Computer / windo wsx 
Cryptography/general 
Politics/mideast 
Politics /guns 
Religion/christian 



Size Topic 

G-2cl-B 
200 Computer/graphics 
200 Motor /motorcycles 

G-3d-B 
200 Computer/windows 
200 Motor /autos 
200 Religion/ general 

G-5d-B 
200 Computer/graphics 
200 Computer/pchardware 
200 Motor /autos 
200 Religion/ atheism 
200 Politics/mideast 



Size Topic 



200 
200 



G-2d-C 
Space /general 
Politics/mideast 



G-3cl-C 
200 Sport/hockey 
200 Religion/ atheism 
200 Medicine /general 

G-5d-C 

200 Computer/machardware 
200 Sport/hockey 
200 Medicine /general 
200 Religion/ general 
200 Forsale/ general 



Size 



200 
200 



200 
200 
200 



200 
200 
200 
200 
200 



Tab. 1 : The characteristics of the Newsgroups datasets used in the clustering experiments. 
Nine subsets have been extracted from the full Newsgroup dataset, with documents 
from either 2, 3 or 5 topics in one network. In addition, we use another network of 3 
topics for parameter tuning. Each cluster is composed of 200 documents. 



nel fc-means as in the previous Section with the Political books graph and the Football graph. 
As a result we get, for each distance measure and each measured parameter value, a sample 
of 20 NMI scores corresponding to partitions with a small within-cluster inertia. This time, 
however, we fix the parameter values for each family of distances to the value providing the 
highest respective mean NMI scores with the tuning data set. The tunings were performed 
for 20 different parameter values, distributed either logarithmically (or linearly, in the case 
of the SP-CT-distance) on a given range of values. The ranges of parameter values and the 
optimal parameter values for each method are reported in Table 12 Note that with the SP-CT 
combination distance, the optimal parameter value is A = 1. This means that already for the 
600 node tuning network the best clustering results are obtained with using only the shortest 
path distance. 



Distance 


Similarity matrix 


Parameter range 


Optimal value 


RSP dissimilarity 


Krsp 


[10" 4 ,20] 


9 = 0.02 


Free energy distance 


K fe 


[io- 4 ,ioo] 


9 = 0.07 


Logarithmic forest distance 


KLogF 


[10- 2 ,500] 


a = 0.95 


SP-CT combination 


Ksp-CT 


[0,1] 


A = 1 


Sigmoid commute time 


KsigCT 


[io- 2 ,io 3 ] 


a = 26 



Tab. 2: The notation of the similarity matrices and the optimal parameter values obtained on 
the tuning data set. 



We then use the tuned parameter values to perform the clustering on the nine remaining 
networks. As in the tuning phase, we again perform the clustering with 20 different initializa- 
tions and choose the clustering that has the smallest within-cluster inertia. This is again done 
another 20 times and the mean and standard deviations of the NMI scores of these 20 best 
clusterings are collected. The results for each of the nine data sets are reported in Tabled We 
performed one-sided t-tests with significance level 0.05 to determine whether a result with 
one method is significantly better than with another. The similarity matrices performing best 
are presented in boldface for each data set. 

From the results we see that the best scores are generally obtained with the free energy dis- 
tance, the randomized shortest path dissimilarity and the sigmoid commute time kernel. Es- 
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NMI K RSP K fe K LogF K sp . C t K SigCT 

Datasets 

G-2cl-A 84.5 ± 0.00 80.7 ± 1.09 83.1 ± 1.47 65.2 ± 0.59 81.6 ±0.00 

G-2cl-B 58.7 ±0.38 58.7 ± 1.74 58.8 ±1.94 51.2 ± 0.46 56.8 ± 2.18 

G-2cl-C 81.0 ±0.00 81.1 ±0.00 75.0 ± 1.13 85.9 ± 0.00 79.6 ± 0.00 

G-3cl-A 76.6 ±0.00 76.2 ±0.00 75.4 ± 0.72 74.2 ± 0.28 77.3 ± 0.00 

G-3cl-B 77.0 ± 0.00 78.3 ± 0.83 75.5 ± 1.42 62.6 ± 0.51 73.0 ± 0.00 

G-3cl-C 76.5 ±0.28 77.0 ±0.50 74.4 ±1.57 71.5 ± 0.50 75.9 ± 0.43 

G-5cl-A 69.6 ±0.15 69.0 ±0.66 60.4 ±3.43 68.1 ± 0.43 66.8 ± 0.16 

G-5cl-B 64.0 ±0.42 64.6 ±0.34 58.7 ± 3.49 59.6 ± 0.59 60.4 ±1.36 

G-5cl-C 61.2 ±0.71 61.6 ±0.87 57.3 ± 2.77 47.8 ± 0.92 57.3 ± 0.46 

Tab. 3 : Clustering performances (Normalized Mutual Information) for each kernel on the nine 
Newsgroup subsets. 



Sim. matrix 


Rank 


Score 


Krsp 


1 


22 


K fe 


2 


18 


KLogF 


4 


-12 


Ksp-CT 


5 


-23 


KsigCT 


3 


-5 



Tab. 4: The ranking of the different similarity matrices according to Copeland's method based 
on the results in clustering the Newsgroups data sets. 



pecially the two former ones seem to perform well quite consistently. The results with the 
kernel obtained with the logarithmic forest distances are on the other hand a bit worse than 
the others except with the network G-2cl-B, for which all the distances, excluding the SP-CT 
combination, give quite similar results. 

In order to rate the overall performances of the different similarity matrices in the clustering 
task with the Newsgroups data sets, we use Copeland's ranking method |30J, which simply 
gives a score of +1 to a method that is significantly superior to another on a given data set, and 
correspondingly a score of — 1 to the other one. If there is no significant difference between two 
methods, they both are assigned a score 0. The ranking of the methods is then computed by 
summing the scores over all pairwise comparisons of methods and over all data sets. The final 
ranking according to Copeland's method is presented in Figure |U From there we see that the 
similarity matrices based on the RSP dissimilarity and the free energy distance succeed best 
in the task of finding clusters resembling the class labeling based on the topics of the text data. 
The sigmoid commute time kernel gives intermediate results whereas the similarity matrices 
based on logarithmic forest distances and the SP-CT distance perform more weakly. 



7 Conclusion 



In this article, we concentrated on graph node distances that generalize the SP and CT dis- 
tances. We first developed the theory behind one such distance, the RSP dissimilarity, by 
providing a new closed form algorithm for computing all pairwise dissimilarities of a graph. 
In addition, we derived the free energy distance based on the Helmholtz free energy. Al- 
though we show that the free energy distance coincides with the potential distance, proposed 
earlier, our new derivation provides a solid theoretical background to the distance. 

The other focus of the article was to compare different generalized graph node distances. 



A In an undirected graph, the commute cost distance is proportional to the commute time distance 21 



We gave simple examples of subtle differences between some of the distance families. When 
used in clustering small real world networks, the different distances gave very similar results. 
However, in the more systematic comparison of clustering performances on larger graphs, 
we could see that there are differences in the results when using different distance measures. 
In this comparison, the RSP dissimilarity and the free energy gave very good results. One 
future plan is to use the different distance families in other machine learning and link analysis 
tasks in order to characterize their differences more and give more insight on which distance 
is appropriate in which context. 



A In an undirected graph, the commute cost distance is proportional 
to the commute time distance 

For deriving this result, we refer to earlier literature. First, we call to mind a well-known 
result |6] that the commute time distance can be computed in terms of the pseudo-inverse of 
the graph Laplacian as 

n 

(11) A% = (l++lt t -2lt t )J2 a «- 

In addition, the authors in \ Y7\ derive a formula for computing the average first passage cost 
from a node to another. This means the expected cost of paths that a random walker must take 
in order to reach the terminal node from the starting node. We denote the average first passage 
cost of going from node s to node t by o st . The formula (see \ V7\, Appendix B, Equation (18)) 
is given as 

n n 

o st = - Ift - la + l u)J2 a *3 c v 

i=i j=i 

From this we can obtain the commute cost distance A^f by symmetrization: 

A^ = o st + o ts 

n 

= £(lV&-4 + 4 + £-£ 

i=l 

n n 

= + Itt ~ ift ~ its) a ij c i'j 

i=l j=l 

n 

— (its + it ~~ 2ltt) 0,ijCij, 

which holds because the graph is assumed undirected. Comparing this result with Equa- 
tion fTTl l we see that the distances only differ from each other by a multiplying factor. More- 
over, we see that this factor is 

A^ <<:,<■' , e T (A o C)e 

4? >:::, ,»-, ^ ' 



n 

~ iti + its) ai i Ci i 



A In an undirected graph, the commute cost distance is proportional to the commute time distance 22 
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